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basilar  mernbrane  is  that  of  a  tapped  delay  line.  It  is 
shown  chat  the  same  theory  may  be  applied  to  speech 
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CHAPTER  1 
INTRODUCTION 


1.  1  This  report  describes  the  latest  results  of  research  in  auditory 

localization  which  began  in  1960. 

Three  significant  results  are  reported.  One,  the  extension  of  the 
original  hypothesis  regarding  time  delays  in  auditory  localization  to  a  theoi-y 
of  perception  supported  by  mathematical  models  and  consistent  with  observa¬ 
tions  and  experiments.  Second,  the  design  and  construction  of  satisfactory 
electrostatic  headphones  which  proved  to  be  the  most  difficult  component  to 
develop  in  a  system  to  reproduce  localization  information  accurately.  Third, 
the  application  of  certain  theoretical  results  to  speech  recognition  and  the 
subsequent  development  of  devices  to  establish  a  basic  man-to-porpolse  com¬ 
munication  link. 

The  theory  reported  here  is  based  on  the  fundamental  concept  that 
we  derive  knowledge  of  cur  environment  by  mentally  inverting  a  transforma¬ 
tion  introduced  on  the  observed  space  by  the  mechanism  of  perception.  The 
form  of  the  transformation  ascribed  to  human  audition,  speech,  color  vision, 
etc.  is  that  of  time  delays.  The  model  described  for  inverting  such  a  trans¬ 
formation  is  realizable  in  the  human  nervous  system.  However,  to  do  so, 
different  functions  must  be  assigned  to  certain  elements  of  perception  mech¬ 
anisms.  For  example,  in  hearing,  the  basilar  membrane  is  no  longer  con¬ 
sidered  to  function  as  a  resonant  structure,  but  rather  as  a  delay  line  from 
which  the  inverse  transform  computation  is  made.  Such  radical  departure 
from  the  popular  concepts  Implies  a  rethinking  of  the  traditions  by  which  nev/ 
theories  are  judged  and  evaluated.  It  has  been  our  experience  that  the 
theory  presented  here,  while  providing  an  understanding  of  many  observa¬ 
tions  ,  is  marked  with  the  tag  of  controversy.  Nevertheless,  while  traditional 
thought  holds  sway,  developments  continue  to  indicate  the  correctness  of  our 
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ti’.oucjht .  Recently',  it  was  reported  that  intelligible  speech  may  be  trans¬ 
mitted  on  a  10-cps  bandwidth.  This  accomplishment  is  achieved  by  a  de¬ 
vice  that  is  said  to  be  an  electronic  representation  of  the  human  ear.  The 
cochlea  and  basilar  membrane  are  simulated  by  a  delay  line  with  detectors 
tapped  along  its  length.  This  is  the  function  of  the  basilar  membrane  in  the 
theory  propossci, 

1.2  A  sy'stem  for  transferring  localization  Information  was  improved 
by'  refining  the  design  and  construction  of  electrostatic  headphones  re¬ 
ported  earlier.  The  development  was  created  by  the  lack  of  availability  of 
headphones  of  requisite  bandwidth.  While  the  design  reported  here  has 
proved  effective  in  accurate  transfer  of  localization  Information,  further  Im- 
piovements  can  be  made. 

1.3  The  work  in  porpoise  communication  (ref.  3.4)  was  continued  with 
emphasis  on  equipment  improvements  and  the  development  of  a  meta-language. 
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CHAPTER  2 

THEORY 

by 

Dwight  W.  Batteau 


2 . 1  Summary 

During  this  contract,  theoretical  aspects  of  the  derived  attention 
functions  for  human  liearing  were  examined  and  hypotheses  were  formed 
concerning  means  of  realizing  such  functions  in  the  human  nervous  system. 
The  transformations  applicable  to  the  role  of  the  pinna  in  localization, 
speech  formation  and  recognition,  room  reverberation,  and  object  identi¬ 
fication,  as  performed  by  the  dolphin,  were  examined.  Estimates  of 
improvement  in  signal  selection  by  attention  and  parameters  in  recognition 
were  examined. 

2 ,  2  Pinnae  Attention 

The  term  "cocktail  party  effect"  has  been  applied  to  the  ability  of 
a  man  to  pay  attention  selectively  to  a  desired  conversation  (or  other 
sound)  in  the  presence  of  other  conversations  or  sounds. 

An  expression  can  be  written  for  the  function  of  the  pinna, 
equation  (2,2.1) 

N 


H(s)  = 

^  -ST 

Pb)  ^  a„e  " 

n  =  0 

H{s)  = 

sound  at  eardrum 

P(s)  = 

sounci  i03chir'5  ths  pin  s 

a  - 

n 

crjefficient  of  reflection  for  the  nth 
delay  path 

r 

del.j'/  i'l  the  nth  delay  patti 
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It  IS  convenient  to  normaliEe  equation  (2.2.1)  by  assigning  the  following 
values: 

a  =1 
o 

T  =0 
o 

An  attentio!!  function  can  be  constructed  applicable  to  equation  (2.2.1) 
as  follows  in  equation  (2,2,2). 

^  -s(t  -7  ) 

A(s)  =  y  b6  M  n  (2.2.2) 

X— (  ^ 

n  =  0 

This  function  is  constructed  by  reversing  the  ordering  of  the  delays  in 
equation  (2.2.  1).  If  (2.2.2)  is  applied  to  equation  (2,  2.  l)  the  following 
result  is  obtained,  equation  (2.2.3). 


-ST 


H(s)  A(s)  =  P(s)  £ 


M 


N 


N  N 


I  1  I  "ih' 

n  =  0  j  =0  k  =  0 

j  ^  k 


-ST,  +ST, 

J  k 
e  e 


(2.2.3) 


Diagrammatically ,  if  P(s)  is  assumed  to  be  a  pulse,  equation  (2,2.  l) 
produces  the  result  shown  in  Figure  2-1  for  N  =  3. 


A 

I 


A 

I 


— ►  time 


Figure  2-1.  The  pinna  transform  of  a  pulse,  assuming  four  paths. 
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Applying  equatioti  (2.2.3)  to  the  same  situation  results  in  the  signal 
sketched  in  Figure  2-2. 


Figure  2-2.  The  result  of  applying  equation  (2.2.3). 


The  central  pulse  in  Figure  2-2  results  from  the  coherent  term  in  equation 
(2.2.3),  as  given  below. 

N 

P(s)  y  a  b  (2.2.4) 

/  /  n  n 

n  =  0 


The  pulses  on  either  side  of  the  central  pulse  in  Figure  2-2  results  from 
the  cross  terms  of  equation  (2.2.3). 


P(s) 


N  N 

I  I 


j  =  0  k  =  0 
j  /  k 


+  ST, 


(2.2.5) 


If  a  second  sound  source  Q(s)  is  located  at  a  separate  place  from 
P(s),  the  transformation  is  the  same  form,  but  has  different  reflection 
coefficients  and  delays.  This  is  presented  as  equation  (2,2.6). 

N 

J(s)  =  Q(s)  y  Cj,£ 

K=0 


(2.2.6) 


Figure  2-3.  The  result  of  applying  equation  (2,2.2)  to  equation  (2.2.6). 


The  significant  difference  between  the  two  results  is  the  large  central 
spike  produced  from  the  reverberant  pulse  train  when  the  transformations 
are  matched. 

Let  us  assume  that  the  two  signals  P(s)  and  Q(s)  are  now  independent 
sources  of  white  noise.  By  definition 


P(t)  P(t  +  t)  dt  =  0 


(2.2.8) 


T  >  0 


y  QCt)  QC 


Q(t  +  t)  dt  =  0 


(2.2.9) 


->3  T  >  0 


Q(t)  P(t)  dt  =  0 


(2.2.10) 


I 
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Thus  the  do layod  signals  roprosont  indopiendont  signal  powers  Ix'cauae 
of  the  ooio  '’.aluo  of  cross  coirolatioiis , 


In  oidor  lo  compute  the  effect  of  signal  piocessing  of  the  kind 
described,  we  can  make  initially  two  assumptions: 

(1)  All  the  coefficients  are  unity. 

(2)  The  addition  is  eguivalent  to  adding  voltages  in  an 
electrical  signal. 


If  there  are  N  terms  in  the  reflection  system,  then  there  is  a  total 
of  N"  terms  in  the  correlation  output.  In  the  case  of  the  coherent  term, 

N  of  these  are  added.  Thus  the  powers  resultant  can  be  computed. 

For  the  monaural  or  single  channel,  the  power  due  to  P(s)  is  as  follows: 

,U  =  -  N  (2.2.11) 

1  P 


jU  =  power  due  to  P(s)  after  the  attention 
transformation  in  a  single  channel 


For  the  same  case,  the  power  due  to  Q(s)  is  as  follows: 

U  = 

1  q 


(2.2.12) 


U  =  power  due  to  0(s)  after  the  attention  to 


1  q 


P(s)  transformation  in  a  single  channel 


The  resultant  power  ratios  are  expiessed  in  equation  (2.2.  13) 


1^ 


1  p-q 


(2.2.  13) 


,R 

1  P,q 


the  ratio  of  pov/ers  in  a  single  channel 
due  to  two  signals ,  one  of  which  (p) 

.  <errt  ::  “s  an  atteiulon  transformation 


In  the  binaural  case,  or  for  two  channels  the  resultant  powers  due  to 


attention  to  one  of  the  signals  Is  as  fellows: 


>U'rs  .'li'i' 
■■'n  r 


,u 


4N  +  2N  -  2N 


(2.2.14) 


power  due  to  P(s)  after  attention  transformations 
in  two  channels 


,U 


2N' 


(2.2.15) 


2 


U 

q 


power  due  to  Q(s)  after  attention  transformations 
in  two  channels 


The  resultant  power  ratios  are  expressed  in  equation  (2.2. 16). 


2^ 


R 

p,q 


3 


_1 

N 


(2.2,16) 


The  limits  for  the  two  cases  are  2  and  3  respectively,  or  3  db  and  4,8  db. 
This  indicates  the  limits  of  selection  corresponding  to  differences  in 
power  of  two  separated  sound  sources,  differently  transformed  by  the 
pinnae  or  the  environment. 

If  a  comparison  is  made  for  two  channel  selection  on  time  difference 
alone  in  an  anechoic  environment,  the  two  channel  ratio  of  powers  Is  2  or 
a  limit  of  3  db.  Thus  reverberation  can  be  used  to  improve  selection  in 
hearing.  Both  the  reverberation  due  to  the  pinnae  and  that  due  to  the 
environment  can  be  used. 


2.  3  Function  Construction 

The  Ideal  detector  is  one  which  does  not  alter  the  characteristics 
of  the  signal.  An  ideal  microphone,  for  example,  would  have  no  resonances 
and  produce  no  reflectio.ns.  If  we  assume  that  the  Organ  of  Corti  is  the 
sonic  detector  for  hearing,  it  should  approach  these  requirements  In  view 
of  its  excellent  performance.  In  this  view,  the  cochlea  provides  a  model 
of  acoustic  termination.  If  the  system  is  viewed  as  a  straight  element, 
it  shov/s  the  taper  of  acoustic  terminations  (as  in  anechoic  chambers)  and 
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the  continuous  change  In  apparent  impedance  towards  tlie  small  end. 
This  is  sketched  in  Figure  2-4. 


Figure  2-4.  The  cochlea  as  an  acoustic  termination. 

While  the  high  frequencies  could  be  terminated  in  this  manner,  the 
scale  is  small  for  low  frequencies.  However,  the  flexible  central  membrane 
and  the  round  window  (as  a  pressure  release  orifice)  can  provide  anechoic 
termination  for  the  low  frequencies.  The  arrangement  of  dense  and  spongy 
bone  surrounding  the  cochlea  also  provides  a  model  of  acoustic  Isolation. 
From  an  engineering  viewpoint,  the  detector  is  isolated,  protected  and 
terminated  anechoically.  From  this  viewpoint,  there  is  no  point  in 
examining  mechanical  resonance  as  a  model  of  tone  or  pitch  detection. 
However,  the  anechoic  model  provides  an  ideal  delay  line,  distributing 
the  signal  over  the  nerve  endings  on  the  basilar  membrane.  If  we  consider 
20  cps  as  the  lowest  frequency  perceived  as  a  tone ,  if  computation  were 
to  be  performed  using  the  Organ  of  Cortl  as  a  delay  line,  then  the  delay 
between  the  oval  window  and  the  termination  should  be  approximately 
12  milliseconds  (minimum),  the  duration  of  a  quarter  wave  of  the  tone. 

If  the  length  is  taken  as  48  mm,  the  resultant  velocity  is  only  4  meters 
per  second,  which  is  slow  In  most  infinite  media,  but  the  construction 
of  the  cochlea  is  such  that  lower  velocities  than  those  in  infinite  media 
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possible  (the  flexible  central  membrane  as  a  low  modulus  element, 
for  example).  VonBckesy  (ref.  2,  1)  reports  that  the  velocity  varies 
'wich  position,  and  that  airpro-ximately  5  milliseconds  delay  occurs  between 
the  oval  window  and  a  point  34  mm  distant.  Thus  tlie  possibility  of 
12  milliseconds  to  termination  is  reasonable  (a  linear  extrapolation  of  Fig¬ 
ure  11-53,  page  458,  implies  about  40  min,  ref.  2.1). 

With  the  mechanics  of  th.e  detector  assumed  to  be  suitable  to  the 
mathematical  model,  it  becomes  possible  to  continue  the  examination  from 
this  viewpoint.  \Ve  should  point  out  that  the  theory  here  presented  involves 
a  time  distributed  sensor,  with  anechoic  characteristics,  feeding  a  computer 
system  using  only  time  delays,  attenuations  and  signed  additions  (plus  or 
m.ir.us).  This  is  different  from  Ewald's  theory  as  described  by  VonBekesy 
as  'a  sinusoidal  movement  of  the  stapes  sets  up  a  series  of  standing 
waves  al'.c.g  the  basilar  membrane  It  is  essential  to  our  theory 

that  anechoic  properties  dominate. 

In  viev,-  of  the  correlation  lengths  concerned,  2  to  300  microseconds 
for  localization,  400  to  2600  microseconds  for  speech,  and  3  to  40  milli¬ 
seconds  for  reverberation  not  discemable  as  separate  echoes,  it  seems 
entirely  possible  that  the  mathematical  functions  for  attention  and  recogni¬ 
tion  in  these  domains  can  be  set  up  in  the  nervous  system  directly  at  the 
basilar  membrane,  Where  longer  times  are  concerned,  and  multiple  trunk 
or  feedback  models  are  examined,  the  computation  arrangement  must  be 
v.Uhin  the  ennervating  system  or  in  the  cortex.  It  is  our  present  view  that 
localization,  because  of  its  survival  function  most  likely  Is  performed 
near  the  nerve  endings. 

To  construct  the  function  which  would  provide  attention  to  a  particular 
location  with  the  Organ  of  Corti  as  a  delay  line,  the  connections  could  be 

^  ^  ^  r»i o_c 
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Figure  Z-5.  Construction  of  pinnal  attention  function  at  the  basilar 
membrane . 


Attention  functions  may  be  constructed  on  any  interval  along  the  Organ  of 
Corti  which  provides  the  requisite  delay  lengths  ,  so  that  other  functions 
may  also  be  provided  applicable  to  reverberation  or  the  character  of  the 
sound. 


The  recognition  of  a  particular  pure  tone  by  construction  of  an  attention 
function  suggests  an  interesting  process.  If  a  maximum  correlation  length 
is  chosen,  ,  then  the  correlation  sequence  can  be  written  as  in 
equation  (2.3.1) 


C.j,(s) 


N  ,  .,-n  , 

e 


I 


n  =  0 


N  ^  ,-n 

-ST..  _  +  s2  T 

N  \  N 

«  2  ^ 

n  =  0 


(2.3.  1) 


C.p(5)  =  pure  tone  correlation  function 
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This  equation  represents  correlation  by  octaves,  and  the  power  (N-l) 
defines  the  minimum  difference  in  interval  which  will  correlate,  or  the 
pitch  of  the  tone.  Consider  the  definition  of  a  pure  tone  as  the  repetition 
of  a  single  cycle,  as  given  in  equation  (2.3,2) 

or 

H(s)  =  P  (s)  y  (2.3.2) 

n  -  0 

P_(s)  =  one  cycle  pure  sine  wave  of  period 
r  =  r 

N 

Since  the  function  is  periodic  on  the  interval  2  ^^  result  of  the 

correlation  applied  to  the  signal  is 

M(s)  =  (N  +  1)  P,j,(s)  (2.3,3) 

There  are  several  consequences  of  this  consideration: 

1.  The  octave  decision  process  suggests  an  octave  identity 
in  pure  tones. 

2.  Higher  frequencies  have  a  greater  possible  correlation  (N  +  l) 
for  a  given  total  delay  length. 

3.  Resolution  of  higher  frequencies  will  be  poorer  than  median 
length  frequencies  ( approaches  l)  for  a  fixed  maximum 
delay  length. 

Resolution  of  low  frequencies  v/ill  be  poorer  than  median 
length  frequencies  (N  +  1  approaches  1)  for  a  fixed  maximum 
delay  length. 

One  possible  consequent  adaptive  process  is  the  selection  of  a  computa¬ 
tional  delay  length  to  optimize  perception  and  resolution  in  the  range  of 
consideration.  This  suggests  that  continued  attention  to  high  frequencies 
would  permit  improvement  of  resolution. 
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Wliile  the  octave  decision  process  is  one  possibility,  there  arc 
two  others.  One  simply  detennines  the  shortest  correlation  length,  or 
period  of  the  signal;  the  other  piroceeds  from  a  minimum  length  cycle  by 
cycle  to  a  suitable  correlation.  There  are,  of  course,  a  large  but  finite 
number  of  ways  of  performing  the  measurement  to  provide  both  attention 
and  decision.  This  suggests  that  the  harmonic  relations  of  music  may 
involv’e  processes  providing  attention  and  decision  in  economical  ways. 
The  octave  attention  and  decision  process  is  the  simplest.  Any  power  of 
two  above  the  longest  interval  will  correlate  in  the  same  arrangement,  but 
the  shortest  difference  will  determine  the  requirement  for  decision.  It  is 
easily  observed  that  ordinarily,  in  simple  harmonies,  the  highest  note 
provides  the  melodic  line. 

2 .  4  Mechanisms  of  Perception 

When  we  consider  the  rationality  of  the  models  provided  and  the 
measurable  behavior  consistent  with  them,  we  are  provoked  to  inquire 
into  the  physiological  mechanism,  the  biophysics  as  contrasted  to 
mathematics  and  mechanics.  Historically,  the  electrical  potential 
spikes  in  the  nerves,  a  consequence  of  Ion  concentration  shifts,  have 
been  considered  a  likely  carrier  of  the  information  transmitted  by  the 
nerves.  However,  when  we  require  mathematical  functions  to  be  con¬ 
structed  for  attention  and  recognition  which  require  high  channel 
capacity  for  their  performance ,  and  use  the  distribution  of  nerve  elements 
in  that  construct ,  the  capacity  of  the  electrical  signals  to  fulfill  the 
requirement  becomes  questionable. 

We  are  able  to  form  a  hypothesis  which  will  provide  consistency 
by  the  use  of  more  recent  developments  in  biophysics  and  physics  by 
assuming  that  transition  of  electrons  between  energy  levels  In  the 
organic  molecules  provides  information,  propagated  by  the  photon  emitted. 
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The  hypothesis  assumes  that  there  are  many  energy  levels,  metaslable 
with  significant  half  lifetimes,  which  can  bp  filled  by  metabolic 
processes  ,  and  which  can  be  stinuilated  into  transition  by  any  of  the 
sensed  phenomena  (sound,  light,  lieat ,  etc.).  The  transition  is  made 
by  the  emission  of  a  p>hoton,  which  stimulates  transition  in  adjacent 
states  and  thus  projiagates.  The  electrical  spikes,  by  hypothesis,  are 
associated  with  the  restoration  of  the  occnpance  of  higher  energy  levels 
(as  in  the  laser  or  maser  amplifier)  by  metabolic  process. 

While  this  model  remains  to  be  investigated,  it  provides  for  high 
channel  capacity,  few  parallel  channels  (two  are  sufficient  if  the  pumping 
spikes  are  brief)  for  continuous  operation,  encoding  of  particular  source 
signals,  computational  networks  (for  K  factors  greater  than  unity),  and 
selectivity  to  stimulus.  It  also  provides  a  rationale  for  the  myelin  sheath 
of  neit'es  as  a  photon  path.  There  are  also  a  number  of  possibilities  of 
no  interest  to  our  present  problem,  but  of  general  interest;  among  these 
are  mediation  of  hormone  production  by  photon  catalysis,  monocell 
sensation  and  computation,  spectrum  perception  by  multilevel  systems 
(as  in  color  vision) . 

2 . 5  Speech 

If  we  examine  the  mechanics  of  speech  production,  we  again  find 
reverberation.  The  vocal  pulse,  or  aerodynamic  noise  as  in  a  whisper, 
provides  the  stimulus .  The  organization  of  the  vocal  tract  performs  the 
transformation.  The  classical  method  of  speech  examination  by  'formant 
frequencies'  suffers  from  two  deficiencies. 

1.  Power  density  Fourier  analysis  omits  any  time  dependent 
relationships . 

2.  Absolute  frequency  characterization  does  not  permit  scale 
changes  in  speech  to  remain  intelligible. 
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The  fust  of  tho  deficiencies  is  domonsirated  by  the  necessity  for 
simultaneous  pulsing  of  reconstituting  filters  in  vocoder  work.  The 
second  of  tlio  two  deficiencies  .  as  shown  by  speech  changtrs  in  taped 
speech,  nidicates  significance  or  recognition  as  a  dimensionless  quantity 
of  relationship  and  not  an  absolute  dependence  on  given  frequencies. 

We  may  resolve  these  deficiencies  by  considering  tlte  transient  character 
of  speech  formation,  and  the  construction  of  dimensionless  relationships 
for  recognition. 

In  the  most  general  representation,  equation  (2.5,1)  applies. 


H(s) 


P(3)  >  a  e  " 

L  n 

n  =  0 


(2.5.  1) 


H(s)  =  speech  code  element 
P(s)  =  stimulus 

=  coefficient  of  reflectivity  for  the  nth  delay 

T  =  delay  in  arrival  of  the  stimulus  through 
the  nth  path 


Although  the  general  expression  is  necessarily  exact,  it  is  not 
necessarily  the  most  useful.  If  we  consider  that  the  vocal  tract  of 


speed  over  a  relatively  v/ide  range  (+1.5:1)  retain  intelligibility  we 
may  inquire  concerning  the  simplest  dimensionless  characterization. 
Since  the  absolute  delay  is  not  measurable  in  the  perception  in  question. 


and  ’.vould  vary  with  the  distance  of  the  hearer  from  the  talker,  we  can 


normalize  the  absolute  representation  by  considering 


a 


0 


(2,5.2) 


T 


0 


0 


(2.5.3) 
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The  first  significant  perception  is  then  in  the  normalized  signal.  In 
an  absolute  system,  the  time  between  tlie  first  arrival  and  the  second, 

,  could  provide  a  code.  In  a  dimensionless  system  the  ratio  between 
Tj  and  the  next  longest  delay  can  provide  a  code.  The  simplest  dimen¬ 
sionless  characterization  can  then  be  written 


2 

^  -ST 

H(s)  =  P(s)  )  a,.,  £  ^  (2.5.4) 

n  =  0 


a 

T 


0 

0 


=  1 
=  0 


Recognition  is  provided  by  ascertaining  the  relative  values  of  and  t^, 
as  in  equation  (2.5,5) 


P 


1,2 


(2.5.5) 


P 


1,2 


T 


1 


character  of  the  speech  code  element 
shortest  delay 
second  shortest  delay 


In  equation  (2.5.1),  H(s)  was  carefully  stated  as  "speech  code 
element"  as  contrasted  to  the  linguistic  term  "phoneme."  It  is  easy  to 
observe ,  by  making  a  tape  loop  of  a  voiced  sound  such  as  "ah,"  that  the 
dynamics  of  speech  are  necessary  to  its  ordinary  characterization.  The 
tape  loop  of  a  voiced  sound,  when  played  continually,  quickly  becomes 
a  machine-like  buzz  with  only  traces  of  the  subjective  natural  sound 
remaining.  We  thus  assume  that  'speech'  is  produced  by  sequences  of 
the  "speech  code  elements,"  consonant  sounds  being  produced  by  rapid 
variation  and  vowel  sounds  being  produced  by  relatively  slow  vaiiaiion. 
Thus  the  characterization  of  the  "speech  code  element"  Is  but  the  first 
step  in  recognition  of  meaningful  speech. 
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A  variety  of  experiments  were  undertaken  once  the  hypotheses 
were  formed,  the  first  concerned  what  could  be  done  to  normal  speech 
which  would  destroy  or  which  would  retain  the  basic  coding.  If  we 
assume  that  the  vocal  pulse  is  similar  to  the  sketch  of  Figure  2-6,  we  can 
make  some  predictions. 


Figure  2-6.  An  assumed  vocal  pulse. 


This  pulse  can  be  approximately  described  by  the  following  time  function, 
equation  (2.5.6) 


P(t) 


-k 

U(t)[  1-  6 


(2.5.6) 


U(t)  =  unit  function 
=  0  :  t  <  0 
=  1  :  t  >  0 

kj  =  factor  of  the  steep  rise 

I 

-  factor  of  the  slower  fall 

If  v/e  examine  the  first  derivative  the  function  for  t  >  0 ,  we  obtain  the 
follov/ing  equation,  (2.5.7) 

.  -k.,t  -k.t 

=  e  ^[(k  +kje  ^-k]  (2.5.7) 

dt  1  ^  c 

If  we  sec  the  derivative  equal  io  zero  we  obtain  equation  (2.5.3) 

~k  t 

(k^  +  k2)  e  ^  (2. 5.8) 
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Now  consider  the  process  of  takiny  iire  derivative  of  the  speech  signal, 
H(s),  in  equation  (2.S.4)  to  obtain  sH(s)  in  equation  (2.5.9) 

fy 

^  “ST 

sH(s)  =  SP{3)  ^  a^€  (2.5.9) 

n  =  0 


It  is  apparent  that  taking  tire  derivative  following  transformation  is 
equivalent  to  applying  a  signal  equivalent  to  the  derivative  of  the  vocal 
pulse  (necessarily  since  these  are  linear  transformations).  Thus  the 
time  derivative  of  the  speech  should  be  Intelligible  if  the  characterization 
is  transformation  dependent,  not  structure  dependent.  Conversely,  if  the 
speech  is  structure  dependent,  not  transform  dependent,  the  result  should 
not  be  intelligible.  (We  knew  this  before  by  whispered  speech.)  More 
interesting,  the  zero  of  (2.5.8)  should  also  transform  the  same  way. 

Thus  clipping  all  of  the  speech  but  the  zero  crossings  after  differentiation 
should  alter  the  recognition  little  if  the  following  condition  is  met , 
equation  (2,5.  10) 


t 


01 


« 


(2.5.10) 


We  also  know  that  the  recognition  result  Is  obtained  (ref.  2.2). 


If  we  examine  higher  derivatives  of  the  speech  signal,  we  can 
write  a  general  expression,  equation  (2.5.11) 

O 

c. 

“ST 

s^  H(s)  =  s’"  P(s)  ^  ^  "  (2.5.11) 

n  =  0 

In  consequence ,  all  orders  of  derivatives  should  be  intelligible  and  the 
increase  in  zeros  of  the  higher  derivatives  may  contribute  to  greater 
redundance  in  clipped  speech  recognition.  The  second  derivative  was 
rested  and  the  results  are  good. 

We  may  no'.v  assume  that  the  problem  of  recognition  is  to  find  a 
recognizing  transformation,  of  the  form  previously  given,  which  matches 
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the  Citaracterizing  transformation  of  the  voca]  tract.  To  do  this  we  need  a 
measure  to  apply.  If  we  consider  the  recognition  transformation  applied 
to  the  charactericing  transformation,  we  have  equation  (2.5.12) 


-ST  „  -s(t  -  T  ) 

A(s)  =  P{s)  ^  a  6  '  )  a  e 

^  n  An 


2  2  2 

e  P(s)  >  a  +  )  >  a,  a..e  e  ^ 

■L  n  Lj  k  Oj 


n=0  i=0k=0 

j  ^  k 


(2.5.12) 


A(s)  =  H(s)  followed  by  a  recognition  transform 
Aj^(s)  =  the  form  of  A(s)  upon  recognition 

To  simplify,  assume  all  a^  are  unity.  The  power  before  recognition 

lO  '^*VCW  iil  UU  I.  I  1  u/ 


=  3x1  =3 


(2.5.13) 


=  power  before  recognition 

After  application  of  the  recognition  function,  but  before  the  t  match,  the 

n 

pov/er  v/ill  be  given  by  equation  (2.5.  14) 


=  9  X  1  =9 


(2.5. 14) 


=  power  after  computation  but  without 
recognition 

When  the  t^  match,  the  power  will  rise  to  the  value  given  by  equation 
(2.5.15) 


11  — 
^3 


j-  c  1  —  It; 
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To  compute  the  recognition  delay  values,  we  need  only  search 
for  the  set  of  which  changes  the  resultant  signal  power  ratio  for 
the  recognized  signal  prior  to  post  processing  from  3  to  5.  There  are 
other  methods  which  have  been  discussed  in  prior  reports  ,  but  this  is 
a  simple  measuring  method,  dimensionless  on  absolute  loudness. 

Even  though  there  is  a  signal  which  is  louder  than  the  one  recognized, 
the  sharp  increase  in  value  at  recognition  should  provide  knowledge 
of  the  basic  speech  code  element.  Once  the  basic  element  is 
recognized,  the  program  of  symbols  to  provide  meaning  may  be 
constructed. 

It  appears  that  almost  any  linear  transformation  applied  to  the 
speech  waveform  will  be  recognizable,  so  long  as  the  time  correlation 
points  are  recognizable.  This  may  be  expressed  in  equation  (2.5. 16). 

2 

^  -ST 

T(s)  H(s)  =  T(s)P(s)  )  a^e  (2.5.16) 

n  =  0 


Thus  this  form  of  recognition  is  a  remarkably  reliable  one.  However, 
if  the  equivalent  stimulus,  T(s)P(s),  is  smoothed,  so  that  sharp 
location  of  the  delay  times  is  lessened,  then  recognition  will  be 
decreased.  Smoothing  of  waveforms  can  be  produced  by  phase  shift 
alone.  Thus  processing  of  the  signal  by  suitable  phase  shift  only 
systems  to  produce  smoothing  should  reduce  intelligibility,  without 
alteration  of  the  power  density  spectrum. 


To  construct  such  transformations  the  following  equation  may 
be  used,  (2.5. 17) 


T  (s)  - 

P 


_RM. 


R(s)  +  j  I(s) 


R{s)  -  j  I(s) 
R(s)  +  j  I(s) 


(2.5. 17) 
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R{s)  =  real  part  of  characteristic 
I(s)  =  imaginary  part  of  characteristic 
T^(s)  =  phase  shift  only  transformation 


input 


A  typical  second  order  circuit  for  this  purpose  is  shown  in  Figure  2-7. 


R  =  resistance  of  Inductance  L 

L  =  inductor 

C  =  capacitor 

R  =  external  resistance 

Figure  2-7.  A  typical  second  order  phase  shift  only  circuit. 

When  speech  is  passed  through  a  sequence  of  such  filters  to  produce 
phase  smoothing,  the  result  can  be  almost  completely  unintelligible. 

Our  experiments  also  showed  that  full  wave  rectified  speech  is 
almost  completely  unintelligible,  but  that  a  small  imbalance  of  3  db 
in  the  two  sides  was  sufficient  to  restore  intelligibility  somewhat. 

If  we  write  the  transformation  in  the  time  domain,  as  in  equation  (2.5.18) 
we  can  examine  the  result  of  full  wave  rectification. 

2 

H(t)  =  y  a  P(t  -  T  )  (2.5.18) 

^  n  n 

n  =  0 
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Full  wave  rectificatiop.  n'.ay  be  written  in  equation  (2.5.  19) 


n  ~  0 


a  P(t  --  T  ) 
n  n 


(2.5.  19) 


in  the  electronic  amplification  of  speech  signals,  conditions  are  such  that 

/  - 
u 


t 


u 


dt  =  0 


(2.  5.20) 


u  '1 


t,  =  lower  bound  of  sample  time 

J. 

t^  =  upper  bound  of  sample  time 

t  -t  =  pitch  period,  or  time  between  vocal  pulses. 
U  1 

In  this  case,  equation  (2,5.21)  is  true 

2 


H(t) 

^  y 

a  P(t  -  T  ) 

n  n 

n  =  0 


Thus  the  characterizing  transformation  no  longer  applies,  for 

2 

I  “ST 

i  )  1 3 1  f  " 

I  '  I  nl 
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2 

n  =  0 


^  A 


n  =  0 


Laplace  transformation 


(2.5.21) 


(2.5.22) 


;  a  P(t  -  T  ) 
n  n 

i  n  =  U 


2 

i  n=0 


a  P(t  -  T  ) 
n  n 


(2,  5.23) 
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It  is  believed,  but  not  demonstrated  mathematically  that  the  half  wave 
case  preserves  the  transformation. 

In  synthesis,  we  found  it  possible  to  produce  well  recognized 
voiced  sounds  by  pulse  spacing.  The  initial  attempts  simply  varied  the 
spacing  between  two  pulses  as  sketched  in  Figure  2-8. 


Figure  2-8.  Two  pulse  synthesis  of  speech  sounds. 

The  two  pulse  synthesis  could  be  run  through  the  sequence  "eeh"  to  "ah" 
by  varying  If  tire  spacing  were  programmed  to  vary  smaothly  back 
and  foith ,  the  sound  sequence  "eeh  -  ih  -  eh  -  ah  -  eh  -  in  -  eeh"  could 
be  repeated,  but  would  not  be  stationery  on  any  interval.  For  example, 
400  iisec  to  1100  iisecs  or  800  jssecs  to  2200  nsec,  or  any  interval  of 
comparable  length  in  between  ivould  sound  like  the  same  sequence  of 
speech  sounds . 

When  a  dimensioning  pulse  was  placed  at  260  /jsec,  as  shown  In 
Figure  2-9,  the  sequence  proceeded  from  "eeh  to  ooh"  on  the  Interval 
400  /isec  to  2600  nsec,  and  the  character  of  the  sound  would  be 
stationary  with  respect  to  the  spacing  t,. 


Figure  2-9.  Addition  of  a  dimensioning  pulse,  to  the  synthesis  of 
speech  sound,  is  fixed  at  260  /isec,  is  variable 
400  ijsec  to  2600  ysec. 
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Further  investigation  indicated  that  addition  of  redundance  in  the 
structure  of  the  pulse,  by  means  of  delay  line.  Improved  the  recognizable 
character  and  stationarity  of  the  basic  speech,  code  element.  The  circuitry 
used  is  reported  elsewhere  in  this  report. 


Our  e.xperiments  generally  supported  the  hypotheses  regarding 
speech  formation  and  recognition,  permitting  prediction  of  Die  effects  of 
processing  and  also  synthesis  by  single  time  related  systems.  The 
redundance  introduced  by  continued  reverberation  appears  to  be  simply 
that,  and  not  essential  to  the  basic  code.  However,  we  can  write  an 
expression  for  the  reverberant  transformation  of  the  vocal  tract  as 
equation  (2.  5.2*1),  using  a  minimum  number  of  delay  lengtlis  (there  are 
undoubtedly  more  of  varying  significance). 


H(s) 


l  cc 

n  =  1  P  =  1 


-kST 

n 

6 


(2.5,24) 


In  order  to  compute,  process,  or  recognize  the  signal  of 
equation  (2.  5.  24)  it  is  necessary  to  truncate  the  expressed  infinite  series . 
Practically  this  amounts  to  terminating  the  series  when  the  signal  to 
noise  ratio  of  succeeding  terms  is  so  small  as  to  be  insignificant.  We 
may  then  write  equation  (2.5.25). 

H(s)  = 

n  =  1  k  =  0 


TTI  -n 


(2.5.25) 


A  form  of  recognition  function  which  could  be  applied  to  (2.5.25)  is  given 
in  equation  (2.5. 26). 
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Application  of  C(s)  from  equation  (2.5.26)  to  the  signal  H(s)  of 
equation  (2.5.25)  results  in  coherent  and  cross  terms,  as  before,  which 
can  be  expressed  in  equation  (2.5.27). 


H(s)  C(s) 


I 


k 
a  £ 
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-skT 


"I 


-5(KT^-hT^) 

a  e 
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h  =  0 


n  =  1  k  =  0 


-sK  (t^  +  T^) 

=  € 


-skT 

n 


+  shT 

e 


n 


+  a„ 


+  cross  terms 


(2.5.27) 


It  is  assumed  that  other  means  of  utilizing  reverberation  in  signal  improve¬ 
ment  and  recognition  can  be  found,  but  the  existence  of  one  such  function 
is  sufficient  to  indicate  the  significance  of  attention  and  recognition 
processes  utilizing  reverberation. 


2 .  6  Object  Recognition 

One  of  the  most  provocative  of  the  outcomes  of  the  experiment  and 
theory  in  localization  has  been  the  ability  to  apply  the  theory  consistently 
to  a  wide  variety  of  sonic  perceptions.  It  may  be  observed  that  localization, 
speech,  room  reverberation  and  music  all  fall  into  the  class  of  "recognition," 
of  a  place,  of  a  v/ord,  of  an  environment,  and  of  an  instrument  respectively. 
Although  man  doe,5  not  seem  to  make  full  use  of  the  possibilities,  bats  and 
dolphins  extend  the  recognition  to  "kind  of  object"  in  food  gathering  and 
navigation.  Since  our  work  has  included  Interaction  with  the  dolphin,  it 
is  appropriate  that  the  present  dlscu-ssion  be  oriented  in  that  direction. 
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We  assume  that  the  dolphin  pulse  provides  P(s),  measured  by 
NOTS  to  be  less  than  3  microseconds  in  rise  time,  since  components 
in  excess  of  120  kcps  were  recorded.  If  we  draw  an  analogy  between 
tlie  transformation  of  the  vocal  pulse  by  the  vocal  tract  and  the  trans¬ 
formation  of  the  dolphin  pulse  by  an  object ,  we  see  that  the  forms  of 
transformation  and  recognition  functions  are  identical. 

In  the  case  of  the  dolphin  recognition  of  objects  ,  we  can  postulate 
two  kinds  of  transformation,  (1)  a  multiplicity  of  internal  paths  and 
reverberation,  and  (2)  due  to  a  multiplicity  of  external  paths  and 
reverberation.  The  forms  of  the  equations  remain  the  same  in  general, 
but  it  is  useful  to  introduce  the  idea  of  "acoustic  coloration,"  or  the 
effect  of  materials  involved  in  reflection  or  transmission  on  the  signal 
with  respect  to  the  Fourier  power  density  spectrum  transformation 
resultant.  The  two  effects  are  then: 

1.  Acoustic  coloration 

2.  Time  distribution 

Both  effects  may  be  taken  into  account  in  the  same  mathematical 
expressions  (since  they  ere  general),  and  appear  in  Laplace  transformation 
notation  as  in  equation  (2,6.1) 

oc 

-ST 

H(s)  :=  P(s)  ^  A^(s)  €  (2.6. 1} 

n  =  0 

A  (s)  =  coloration  transformation  through  the 

nth  path 

=  mean  delay  through  the  nth  path 

The  difference  betv^een  (2.6.  1)  and  previous  expressions  is  the  functional 
charactei  of  the  coefficieni,  which  previously  was  written  as  a  constant, 
indicating  coloration  in  (2.6.1)  and  no  coloration  previously.  Tiie 
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equation  (2.6.  1)  Is  the  more  general  expression  more  compactly  expressed, 
for  A(s)  is  given  by  equation  (2.6.2) 

oc 

A{s)  =  y  a{s)  e  dr  (2.6.2) 

0 

In  equation  (2.6.2)  the  correspondence  (2.6.3)  may  be  drawn 

a  (s  j)  =  (2.6.3) 

Thus  a  continuous  system  of  delays  Is  implied  in  (2.6.1)  by  A(s) 

“STri 

accompanied  by  a  discrete  system  of  delays  expressed  by  e  The 

continuous  system  accounts  for  reflection,  absorption,  and  transmission 
chaiacteristics  of  materials.  The  discrete  systenri  accounts  for  the 
arrangement  of  materials  into  a  structure.  These  expressions  thus  apply 
to  moths  (bats),  fish  (dolphin),  and  submarines  (man). 

The  recognition  relationships  also  remain  the  same  ,  except  that 
orientation  of  the  object  in  three  dimensional  space  effects  the  form  of 
the  discrete  system.  If  we  observe  that  any  tv/o  parts  of  a  structure 
have  a  maximum  separation,  when  viewed  perpendicular  to  the  line 
Joining  themi ,  we  can  write  orientation  expressions  in  the  discrete 
system,  equation  (2,6.4) 

T  =  T  cos  9  ••  0  <  0  <  -f-  (2,6,4) 

n  Mn  n  —  n  —  2  '"i 


0  =  angle  between  the  connecting  line 

between  two  structural  elements  and 
the  normal  to  the  sound  wave  front. 

=  maximum  delay  in  the  nth  path  between 
structural  elements 


trivial  cbccrvatiGn  that  in  any  three  diinctaiujiQi  bLxLiCiuie  nut  arr 


can  be  zero.  If  we  assume  a  known  form  transformation,  equation  (2.6.5) 
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^  -ST 


(2,6.6) 


Then  the  rotated  form  is  equation  (2,6.7) 


=  I  % 


-S,,T  COS  0 
M  n  n 


(2.6.7) 


F  (s)  =  F(s)  transformed  by  rotations 
K 

Given  a  recognition  transformation,  equation  (2.6.8) 


=  I 


-s(t^,  T  ) 

N  n 


(2.6.8) 


The  corresponding  rotated  form  can  be  given,  in  equation  (2.6.9) 


"I 


-S  (T-,  -  T  )  cos  0 
N  n  n 


(2.6.9) 


Unfortunately  (2.6.9)  may  have  values  of  anticipation  rather  than  delay, 

making  realization  of  it  not  possible  in  a  woild  bound  to  the  present 

moment.  However,  C  (s)  can  also  be  translated  in  time,  as  given  in 
R 

equation  (2,6, 10) . 


-ST  „  -s  (t-,  -  T  )  cos  0 

r-  t  \  .  c  \  ^  N  n  n 

C„(s)  =6  2  ®n 


(2.6.10) 


C„(s)  =  recognition  function  rotated  and  translated 
c  in  time 


time  translation  of  recognition 
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Next  it  should  be  observed  that  only  three  distinct  rotations 
(or  angular  degrees  of  freedom)  are  possible  for  a  real  three  dimensional 
object.  The  resultant  angle  of  the  connecting  line  on  the  normal  to  the 
wave  front  must  be  a  consequence  of  those  three  rotations,  which  can 
be  expressed  in  an  object  framed  coordinate  system  by  equation  (2,6,11) 


cose  =  R  (*  ,  ,  4,,  «,)  (2.6.11) 

n  n  1  2  3 

4  =  angle  of  structural  line  relative  to 

n  .  ,  . 

the  object  name 

^l'*2'^3  ~  rotation  of  the  object  frame  relative 

to  the  observer  frame 

We  can  now  rewrite  equation  (2.6.10)  as  equation  (2.6,12) 


-ST 


N 

Y 

L, 

n  =  0 


a  e 
n 


4, 


1,2,3) 


(2.6.12) 


where  is  given  by  the  longest  line  in  the  object  frame.  From  equa¬ 
tion  (2.6. 12)  it  should  be  possible  to  recognize  an  object  at  any  angle  and 
also  to  determine  Its  structure  (Internal  and  external)  by  viewing  it  from 
three  different  angles  (determining  4^).  All  of  this  can  be  performed 
through  delays,  attenuations,  signed  additions,  and  memory.  Thus  tjie 
acoustical  recognition  by  dolphin,  given  sufficient  signal  to  noise  ratios, 
can  provide  a  detailed,  three  dimensional  model  of  his  surroundings  and 
its  contents,  or  permit  him  to  "see"  sonlcally. 


2 . 7  Localization  Experiments 

Among  the  many  experiments  performed,  there  are  three  taken 
recently  in  the  interest  of  establishing  the  accuracy  of  the  system  with 
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the  equipment  developed,  particularly  the  headphones.  The  experiment 
was  performed  by  orienting  the  pinnae  horizontally  and  producing  the 
sound  of  a  maracas  at  16  different  positions  around  the  stand.  A  diagram 
of  the  experimental  setup  is  shown  in  Figure  2-10.  In  Figure  2-lla,  the 
data  is  presented  on  a  scatter  diagram  for  elevation.  The  columns  headed 
"Report"  in  the  figures  refer  to  the  reported  location  by  the  hearer.  The 
rows  labeled  "Place"  refer  to  the  actual  location  of  the  sound  source. 

The  consistency  of  the  resultant  diagram  is  indicative  of  relatively 
accurate  location  in  elevation.  In  Figure  2- 11b  the  result  of  a  similar 
test  in  azimuth  is  presented. 

In  order  to  separate  room  effects  from  the  effects  of  the  pinnae, 
a  similar  test  was  made  using  two  bare  microphones  but  otherwise 
identical  equipment.  The  result  is  shown  in  Figure  2-llc, 

As  is  to  be  expected,  the  sidedness  of  the  bare  microphones  is 
observable,  but  front-back  and  up-down  ambiguities  (equivalent  in  such 
a  test)  show  strongly  on  the  scatter  diagram.  In  the  case  of  the  micro¬ 
phones  with  pinnae  and  the  system  in  use,  however,  the  ambiguities  are 
tremendously  reduced,  and  the  position  relationships  relatively  accurately 
reported, 

A  last  note  on  these  experiments.  In  the  case  of  localization 
with  pinnae  subjects  reported  confident  decisions  were  easily  made. 

With  bare  microphones,  the  expressed  subjective  confidence  In  the 
decision  was  considerably  less. 
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CHAPTER  3 

EXPERIMENTAL  DEVELOPMENTS 

by 

Roland  L.  Plante 

3. 1  Electrostatic  Headphone 

The  importance  of  fidelity  in  all  components  of  a  localization  system 
has  been  emphasized  sufficiently  in  earlier  reports  on  sound  localization 
(refs.  3.1,  3.2,  3,3,  3.4).  Furthermore,  it  was  stated  that  the  only  com¬ 
ponent  limiting  the  total  system  bandwidth  was  the  headphone.  The  initial 
work  to  overcome  this  limitation  involved  the  use  of  condenser  microphones 
(Bruel  &  KJaer  Model  4135)  driven  as  headphones.  Localization  with  this 
improvised  headset  surpassed  all  expectations.  Certain  disadvantages 
were  evident  however.  The  sound  level  output  was  only  80  db  and  the 
dynamic  range  was  inadequate.  A  developmental  program  was  undertaken 
therefore  to  provide  electrostatic  headphones  capable  of  duplicating  the 
performance  of  the  improvised  B&K  headset  without  the  serious  drawbacks 
noted. 

The  design  goal  was  to  produce  a  single  electrostatic  transducer 
to  reproduce  frequencies  from  4  0  cycles  per  second  to  2  0,000  cycles  per 
second  within  a  loleiaiice  of  1  db  at  a  sound  pressure  level  of  95  db 
re  .0002  when  loaded  into  a  closed  volume  approximating  the  ear  canal. 
While  the  design  reported  here  does  not  meet  all  the  proposed  parameters  , 
it  does  provide  a  headset  v.'hich  localization  tests  have  shown  to  be  very 
effective  In  aural  coupling  (see  page  2-28).  Furthermore,  it  embodies 
significant  improvements  over  designs  reported  earlier. 
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A  brief  summary  of  important  performance  specifications  follows: 

Diaphragm  displacement  response:  50-25,000  cycles  per  second 

+  3. 5  db 

3 

Sound  pressure  in  1  cm  coupler  95  db  re  .  0002  /t  bar 
at  1  kc  with  signal  voltage  = 

20  V  RMS 

Bias  voltage  200  v  DC 

The  requirement  for  a  constant  amplitude-frequency  characteristic 
was  set  by  consideration  of  the  microphone  placement  in  the  pickup.  The 
microphone  is  located  at  the  entrance  to  the  ear  canal  of  the  replica  ear. 

Its  diaphragm  displacement,  and  hence  the  electrical  output,  is  propor¬ 
tional  to  the  Incident  acoustic  pressure.  To  reproduce  this  pressure 
proportionately  at  the  ear  canal  entrance  of  the  listener  requires  that  the 
displacement  response  of  the  headphone  diaphragm  be  flat.  This  does  not 
imply  that  the  pressure  at  the  ear  drum  will  be  constant  with  frequency. 

It  is ,  in  fact,  anything  but  flat  due  to  the  acoustic  properties  of  the  ear 
canal  and  ear  drum.  Earlier  design  effort  concentrated  on  producing  a 
constant  acoustic  power  output,  which  requires  a  -20  db/decade  diaphragm 
displacement  function.  The  headset  reported  here  has  a  power  output 
which  increases  with  frequency. 

The  nom.inal  sound  pressure  level  of  35  db  re  .0002  p,bar  is 
chosen  to  provide  the  dynam.ic  range  customarily  found  in  headphones. 

3 . 2  Design 

The  electrostatic  headphone  element  consists  of  a  fixed  back  plate 
electrode  and  a  metallized  mylar  diaphragm  0.25  mils  thick.  Mylar 
thickness  of  0.  15  mils  was  also  used.  By  designing  t.he  element  as  an 
insertion  type  to  be  located  at  the  entrance  of  the  ear  canal,  the  diaphragm 
displacement  require.mcnts  to  produce  adequate  sound  pressure  are  lessened. 
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Furthermore,  a  flat  response  characteristic  may  be  expected  If  the 
resonant  frequency  occurs  at  the  upper  limit  of  the  desired  bandwidth. 

Then  for  frequencies  lower  than  the  resonant  frequency,  the  displace¬ 
ment,  and  hence  the  pressure  in  a  closed  cavity  (dimensions  small  com¬ 
pared  to  wavelength)  is  constant  for  constant  signal  voltage.  Since 
the  resonant  frequency  is  heavily  damped,  it  does  not  appear  in  the 
characteristic  manner.  It  was  measured  by  noting  the  frequency  at 
which  90°  phase  difference  between  the  drive  signal  and  the  diaphragm 
displacement  occurs. 

Sensitivity  of  the  element  depends  on  the  gap  between  the 
diaphragm  and  the  back  plate.  The  gap  is  determined  by  the  tension 
in  the  diaphragm,  the  clearance  machined  into  the  back  plate,  and  the 
polarization  voltage  applied.  In  the  final  configuration  the  operating 
gap  is  .00015  Inches.  This  was  determined  by  measuring  the  capacitance 
of  the  element  without  bias  (25  pfd)  and  with  bias  (80  pfd).  The  gap  without 
bias  is  .0005  inches. 

3.  3  Performance  Measurements 

The  measurements  made  to  specify  the  performance  of  the  head- 
phone  are:  (a)  pressure  response  in  a  1  cm  coupler;  (b)  diaphragm  dls- 
placement  response  when  loaded  into  a  human  ear  and  into  the  1  cm'^ 
coupler;  (c)  sensitivity;  (d)  second  harmonic  distortion;  (e)  phase 
response  of  diaphragm  displacement. 

Since  the  headphone  element  is  located  at  the  entrance  to  the 
ear  canal,  it  was  decided  to  use  as  the  coupler  one  whose  volume  was 
equal  to  that  of  a  human  ear  canal  and  whose  shape  would  minimize 
wave  length  effects.  Reference  to  published  data  as  well  as  measurement 
of  our  own  ear  canal  volume  using  water  and  a  graduated  hypodermic 
syringe  yielded  a  volume  of  1  cm'".  The  shape.  Figure  3-1,  was  selected 
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from  reference  3.5  and  scaled  to  give  1  cm  cavity  volume. 


Unfortunately,  the  exact  required  shape  to  give  a  flat  coupler  characteristic 
was  not  easily  duplicated.  One  reason  was  the  necessity  for  retaining 
the  microphone  protector  cap.  This  resulted  in  a  cavity  which  was  shaped 
as  shown  in  Figure  3-2  because  the  diaphragm  of  the  microphone  is  recessed 


Figure  3-2.  Actual  coupler  shape  caused  by  retaining 
protector  cap. 


Therefore ,  the  frequency  characteristic  was  measured  using 
Bruei  &  KJaer  4134  microphones  as  both  the  transmitter  and  the  receiver. 
When  this  characteristic  is  subtracted  from  the  pressure  response  curve 

3 

of  the  headphone  in  the  1  cm  coupler,  the  displacement  response  is 
obtained.  Figure  3-3  shows  the  coupler  characteristics. 

The  diaphragm  displacement  was  measured  u.sing  the  circuit 
shown  in  Figure  3-4.  Operation  of  the  unit  is  based  on  varying  the  capacity 
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cm  Coupler  Characteristic 
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of  a  circuit  tuned  to  21.4  me.  The  output  is  a  voltage  linearly  pro¬ 
portional  to  the  displacement  of  the  diaphragm  over  the  range  of  interest. 
Figure  3-5  is  the  diaphragm  displacement  response  when  loaded  into 
the  1  cm"'  coupler  and  Into  a  human  ear  canal. 

3 

Sensitivity  was  determined  using  the  1  cm  coupler  and  a 
B&K  4134  microphone.  By  measuring  the  microphone  output  and  knowing 
its  sensitivity,  headphone  sensitivity  is  easily  calculated.  Figure  3-6 
is  the  pressure  response  in  the  coupler.  SPL  at  1  kc  is  95  db  re  .0002  /i  bar, 

3 

Second  harmonic  distortion  of  the  microphone  output  in  the  1  cm 
coupler  was  measured  using  a  Hewlett-Packard  Wave  Analyzer  Model 
302A.  Figure  3-7  shows  the  measured  distortion  at  90  db  SPL.  The 
distortion  shown  is  due  to  mechanical  and  electrical  factors.  The  latter 
may  be  calculated. 

The  electrostatic  force  on  the  diaphragm  may  be  written 

F  (3.3.1) 

Ej^  =  polarization  voltage 

E  =  signal  voltage 

s 

For  a  sinusoidal  input  signal 

2 

F  (E.  +  E  sinut)  (3.3.2) 

D  p 

E  =  peak  voltage 
P 

Expanding  equation  (3.3. 2)  and  substituting  knovv-n  ide-ntlties  gives 
equation  (3.3.3) 

2  E  ^  E  ^ 

F  ^  (E,  +  “.p-  )  +  2E,  E  sin  ut  -  cos  2wt  (3.3.3) 

b  2  bp  2 

The  percent  second  harmonic  distortion  is 

E 

%  distortion  =  x  100  (3.3.4) 

^^b 
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Figure  3-5.  Diaphragm  displacement  response. 
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At  90  db  SPL  E  =  10  v  and  E,  =  200  v.  Therefore,  the 
P  h 

electrical  distortion  is  1.25%. 

Phase  measurements  were  made  with  a  Tektronix  Dual  Beam 
Oscilloscope,  Model  502.  Figure  3-8  is  the  measured  phase  response. 
Note  that  90°  phase  shift  occurs  at  19  kc,  the  resonant  frequency. 

3.4  Headphone  Construction 

The  configuration  of  the  final  element  design  is  shown  in 
Figures  3-3,  3-10,  3-11  and  3-12.  With  the  exception  of  the  relatively 
high  distortion  level,  (which  can  be  decreased  by  increase  in  bias 
voltage  or  by  Increase  in  sensitivity)  the  headset  shown  is  eminently 
suitable  to  the  needs  of  sound  localization. 

Proper  coupling  of  the  headphone  output  to  the  ear  canal  required 
the  development  of  suitable  ear  plugs  as  shown  in  Figure  3-12. 

While  an  under-the-chin  arrangement  is  shown,  the  elements 
were  also  adapted  to  circumaural  muffs,  in  which  case  each  element 
is  spring  loaded  within  the  muff. 

Some  attempts  were  made  to  permanently  polarize  the  mylar  film 
so  as  to  eliminate  the  need  for  external  bias  (ref.  3.6).  Within  the  limits 
of  our  work,  we  found  that  the  applied  signal  voltage-  tended  to  reduce 
the  trapped  charge,  thus  reducing  the  effective  bias  and  increasing  the 
signal  distortion.  It  is  recommended  that  this  technique  be  explored 
more  fully  as  a  means  of  eliminating  what  is  commonly  accepted  as  the 
chief  drawback  to  electrostatic  headphones,  namely,  the  polarization 
voltage . 
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Figure  3-9  Exploded  Photo  of  Element 


Figure  3-10  Assembled  Element 
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3.5  Headphone  Drivers 

The  ampilfier/drlver  for  electrostatic  neadphones  has  been  dis¬ 
cussed  in  reference  3.3.  Essentially,  this  amplifier  combines  a  pre¬ 
amplifier  function  with  the  drive  function  supplying  the  required  signal 
voltage  swing  and  D.C.  polarization. 

To  make  the  headphone  more  compatible  with  conventional  drive 
systems,  a  simple  circuit.  Figure  3-13,  was  put  together  and  tested. 

This  circuit  provides  compatibility  between  conventional  audio  outputs 
and  the  electrostatic  headphone. 

A  third  method  which  was  studied  and  tested  is  shown  in  Figure  3-14. 
Here  the  bias  is  provided  by  a  capacitor  charged  to  the  required  voltage 
by  means  of  the  signal  input.  This  di'iver  eliminates  the  need  for  a  biasing 
battery  and  provides  compatibility  with  standard  audio  outputs. 

3 .  6  Acoustic  Coloration 

Sounds  reproduced  from  the  binaural  pick-up.  Figure  3-15,  are 
subjectively  rich  in  the  mid-frequencies.  While  this  does  not  appear 
to  seriously  affect  the  localization  transfer  capability,  it  was  felt  that 
a  closer  reproduction  of  naturalness  would  further  improve  the  coupling. 

Experiments  were  designed  to  study  the  differences  between 
human  and  cast  replica  ears .  A  test  panel  was  constructed  which 
allowed  a  subject  to  essentially  put  his  ear  on  a  board,  Figure  3-16. 

The  cast  replica  could  be  mounted  in  the  same  way,  Figure  3-17.  The 
purpose  of  the  panel  was  to  create  equal  acoustic  conditions  surrounding 
the  ear.  A  speaker  was  excited  with  wide  band  noise  and  a  probe  tube 
adapted  to  a  Bruel  &  KJaer  4134  microphone.  Data  was  recorded  on  a 
B&K  Level  Recorder  fed  by  a  B&K  spectrum  analyzer. 
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Two  Important  measurements  were  made: 

1.  The  human  ear  and  replica  ear  were  plugged  with  clay  at  the 
canal  entrance  and  probe  tube  measurements  were  made  on  each.  No 
significant  difference  was  noted.  It  was  concluded  that  the  material 
selected  for  the  ear  replica  was  acoustically  satisfactory. 

2.  The  probe  tube  was  placed  at  the  entrance  of  an  open  human 
ear  canal  and  data  recorded.  The  probe  was  then  located  at  the  same 
position  in  the  replica  ear  with  the  microphone  in  its  normal  position . 
i.e. ,  at  the  entrance  to  the  ear  canal.  A  distinct  difference  was 
measured  as  shown  in  Figure  3-18.  Emphasis  of  the  range  between 

2  kc  and  7  kc  is  evident.  It  appears  that  the  energy  in  this  band  Is 
naturally  coupled  to  the  ear  drum.  This  obviously  does  not  occur  in  the 
replica  ear-microphone  adaptation.  Further  measurements  were  made 
with  the  microphone  in  different  places  without  any  real  improvements. 

Two  problems  remain  in  the  pick-up.  One  is  the  proper  placement 
if  the  microphone  which  will  produce  the  same  characteristics  as 
measured  on  a  human  ear  and  a  replica  ear.  The  second  is  Improved 
acoustic  isolation  of  the  microphone  in  the  pick-up  for  frequencies 
below  500  cps.  Neither  of  these  problems  is  insurmountable.  Con¬ 
tinued  study  should  produce  a  pickup  which  differs  very  little  with 
the  mechanical  function  of  the  external  ear. 
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White  noise  measured  at  the  entrance  of  ear  canal  with 
BSC  4134  microphones  with  probe.  Curves  shown  are 
the  variations  around  probe  tube/noise  characteristics . 
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Figure  3-18.  Effects  of  microphone  location  in 
binaural  pickup. 
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CHAPTER  4 

INVESTIGATION  OF  THE  RECOGNITION  FACTORS 
OF  HUMAN  SPEECH 


bv 

Richard  H.  Spencer 


4 . 1  Introduction 

The  theory  presented  in  the  preceding  chapter  was  developed  from 
observations  and  experiments  In  auditory  localization.  Subsequently,  it 
was  used  as  a  basis  for  investigating  the  recognition  factors  of  human 
speech.  The  application  of  the  theory  to  speech  work  has  provided  a 
new  approach  to  old  problems  which  have  yielded  little,  if  any,  to  the 
traditional  methods  of  analysis  and  experiment. 

Interspecies  communication  between  man  and  porpoise  has  been 
discussed  in  earlier  reports.  Under  this  co.ntract ,  further  improvements 
were  made  in  the  translators  and  a  meaningful  test  program  was  conducted 
at  Point  Mugu,  California.  Much  v,rork  remains  to  be  done  in  this  area; 
however,  it  is  rewarding  that  the  initial  efforts  based  on  the  new  concepts 
have  resulted  in  rapid  and  significant  progress. 

4 . 2  Pitch  Extractor 

The  processing  of  speech  for  computer  analysis,  analysis  of  the 
significance  of  various  parts  of  the  vocal  pulse  train,  digitization  of 
speech  sounds  and  the  processing  of  speech  for  speaker  identification 
clues  all  frequently  require  that  t.he  start  of  each  voice  pulse  train  be 
identified. 
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Work  in  the  above  areas  of  speech  processing  and  study  have  in 
the  past  relied  on  several  methods  of  identifying  the  start  of  a  vocal 
pulse  train:  (a)  Use  of  a  discriminator  to  identify  a  predetermined  level 
of  signal;  (b)  Use  of  single  or  double  differentiation  of  the  speech 
waveform  prior  to  a  discriminator:  (c)  Visual  identification  from  a  high 
speed  recording  of  the  speech  waveform.  The  above  methods  suffer  from 
slowness  or  from  uncertainty,  particularly  if  the  signal  Intensity  varies. 

It  has  been  found  that  by  processing  speech  waveforms  with  a 
special  form  of  nonlinear  circuit  a  positive  identification  of  the  start  of 
each  vocal  pulse  train  is  produced.  The  use  of  this  processing  circuit 
makes  possible  quicker,  easier  and  more  reliable  identification  of  the 
start  of  the  vocal  pulse  train  than  is  had  with  the  prior  methods. 

Figure  4-1  illustrates  a  block  diagram  of  the  system  by  which  the 
Identification  of  the  start  of  the  vocal  pulse  train  is  made. 

A  speech  waveform  is  fed  to  two  peak  detectors  —  one  a  positive 
peak  detector,  the  other  a  negative  peak  detector.  These  two  detectcrs 
have  carefully  selected  decay  times.  The  outputs  of  the  two  detectors 
are  weighted  and  summed.  The  resultant  waveform  is  a  single  pulse, 
the  leading  edge  of  which  identifies  the  start  of  the  vocal  pulse  train  as 
indicated  in  Figure  4-2. 

Performance  of  the  pitch  extractor  is  improved  by  the  addition  of 
a  trigger  circuit  activaied  by  the  output  waveform.  In  addition,  pre-filtering 
of  the  incoming  speech  waveform  by  a  low-pass  filter  provides  more 
certitude  in  the  output  response. 

4 . 3  Gating  of  Speech  Waveforms 

In  work  directed  toward  determining  the  significant  characteristics 
of  speech,  an  experiment  was  set  up  which  permitted  the  selective  gating 
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Figure  4-1  Block  Diagram  of  Pitch  Extractor 


Figure  4-2  Waveforms  of  Pitch  Extractor 
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of  various  parts  of  the  vocal  pulse  train.  In  this  work  it  was  possible 
to  select  a  given  interval  of  time  at  a  predetermined  time  from  the  vocal 
pulse  for  gating  through  of  the  vocal  pulse  train. 

Several  tentative  conclusions  are  made  on  the  basis  of  this  work. 

(1)  With  a  normal  10  millisecond  pitch  period  any  3  millisecond  interval 
of  "on"  time  produced  intelligible  output. 

(2)  Operation  with  single  or  double  differentiation  of  the  incoming 
speech  waveform  produced  better  performance  than  was  shown  for  unprocessed 
speech. 

(3)  If  the  gate  was  programmed  to  operate  on  alternate  vocal  pulse 
trains,  the  subjective  interpretation  was  that  the  speaker  had  slowed 
down  his  speech. 

4.4  Delay  Line  Synthesizer 

The  100  section,  1000  microsecond  delay  previously  constructed 
for  NOTS  was  modified  by  the  addition  of  4  compensated  level  restoring 
amplifiers  and  5  summing  amplifiers  of  controllable  gain  and  delay  posi¬ 
tioning.  These  modifications  provided  an  instrument  for  use  in  analysis 
and  synthesis  of  complex  waveforms. 

4.5  Rasp  Generator 

The  Rasp  Generator  is  a  device  which  converts  human  speech  into 
a  special  form  believed  useful  in  communicating  with  porpoises.  Input 
to  the  device  is  supplied  to  an  AKG  capacitor  microphone  and  preamp. 

The  resulting  electrical  waveform  is  amplified ,  double  differentiated  in 
a  delay  line  circuit,  amplified  again,  than  fed  to  a  Schmitt  trigger.  The 
resulting  waveform  is  available  directly,  integrated  once,  and  integrated 
twice.  The  Instrument  is  portable  and  operated  from  rechargeable  batteries. 

A  circuit  diagram  for  this  instrument  is  shown  in  Figure  4-3. 
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4 . 6  Headset  Ampllfler 

A  headset  of  prior  design  and  construction  was  modified  to  improve 
the  gain  and  phase  response.  The  amplifier  is  a  two-channel,  portable, 
battery-operated  unit.  The  set  is  intended  to  be  fed  from  hydrophones  with 
output  to  a  pair  of  headphones.  The  enclosure  is  water  resistant,  batteries 
are  provided  with  a  built-in  recharging  circuit.  The  following  specifications 
are  met. 


Number  of  channels: 

Voltage  gain: 

Bandwidth: 

(see  Figure  4-4) 

Phase  shift: 

+  10“  to  -10“ 

+  20“  to  -20“ 

+  40“  to  -40“ 

Input  resistance; 

Output  resistance: 

Output  swing: 

open  circuit: 

470  ohm  load: 

Noise  referred  to  input: 

input  open: 
input  shorted: 

Crosstalk  at  1000  cps: 

left  to  right: 
right  to  left: 


variable  to  700 
20  cps  to  70  kcps 

(see  Figure  4-5) 

100  to  15 ,000  cps 
50  to  25 ,000  cps 
30  to  50 ,000  cps 

4000  ohms  (midband) 

300  ohms 

3  volts  rms 
2. 5  volts  rms 


4  microvolts  rms 
1  microvolt  rms 


-47  db 
-43  db 


Batteries:  2  CD28  nickel  cadmium 

rechargeable  units  , 

.  225  ampere  hour 

Recharge:  22.5  milliamperes  per  battery 

by  transformer  and  rectifier 
operated  from  line  voltage 
(14  hcu's  for  full  recharge) 
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Drain: 

Operating  time 
between  charging: 

Input  connectors: 

Output  connectors: 

Recommended  headphones; 
Circuit  diagram: 


30  milliamperes  per  battery 
7  hours 

XLR  3-13  connector 

microphone  dual  jacks  —  fit 
small  plugs  only 

AKG  -  400  ohms 

See  Figure  4-6 


4 . 7  Investigation  of  Torsional  Delay  Lines 

Two  commercial  models  of  sonic  torsional  delay  lines  were  investi¬ 
gated  as  a  means  to  achieve  multiple  delay  and  gain  synthesis  of  complex 
signals . 

Pulsed  sine  wave  excitation  of  the  lines  was  used.  Output  was 
observed  on  a  CRO.  A  single  delay  magnet  was  Installed  to  produce  one 
delayed  signal.  The  signal  produced  by  this  magnet  was  clearly  discernible 
and  could  be  positioned  at  any  delay  within  the  range  of  the  line.  (Induced 
voltage  mode  of  line  operation.) 

Two  serious  drawbacks  are  evident  for  the  proposed  application. 

(1)  The  signal  level  produced  by  the  delay  magnet  is  no  more  than 
10  times  the  background  noise  level  even  when  narrow  band  amplification 
is  used. 

(2)  Undesired  residual  siynais  occur.  These  signals  are  apparently 
caused  by  remanent  fields  left  in  the  torsional  line  by  previous  application 

of  a  delay  magnet.  These  signals  are  as  much  as  1/3  the  amplitude  of  the  signal 
produced  by  a  delay  magnet. 

Work  was  suspended  on  this  type  of  synthesizer  because  of  these 
two  drawbacks. 
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Figure  4-5  Headset  Amplifier  —  Phase  Response 
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CAPACITANCE  IN  MICROFARADS 

rig.  4  5  Circuit  Diagram  of  Headset  Amplifier 
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4 . 8  Experimental  Voice  Synthesizer 

An  early  model  voice  synthesizer  which  was  originally  constructed 
in  haywire  form  was  reconstructed  into  a  neat  package  for  use  In  experiments 
on  phoneme  synthesis .  The  circuit  diagram  for  this  device  is  shown  in 
Figure  4-7. 

4 . 9  Experiments  with  Voice  Boxes 

A  number  of  experiments  were  conducted  in  attempts  to  determine 
the  significant  variations  that  occur  in  the  human  vocal  apparatus  in  the 
production  of  speech  sounds.  Some  of  these  experiments  were  concerned 
with  the  production  of  synthetic  speech  sounds  by  means  of  artificial  vocal 
cavities.  These  artificial  vocal  cavities  were  constructed  of  wood  and 
each  had  a  single  movable  control  piece.  Tests  were  conducted  by  feeding 
a  short  acoustic  pulse  100  times  per  second  into  the  box  and  listening  to 
the  sound  emitted  from  an  opening  in  the  box.  The  sound  resulting  in  these 
experiments  did  show  phoneme-like  character.  As  in  previous  experiments 
on  the  generation  of  synthetic  speech  sounds  it  was  found  that  dynamics 
play  an  important  part.  If  the  control  piece  Is  moved  continuously  between 
its  limits  it  is  much  easier  to  identify  each  sound  with  a  particular  phoneme 
than  it  is  with  the  control  piece  in  a  fixed  position.  Further,  the  sounds 
have  a  monotonous  character;  variation  of  the  repetition  rate  and  the  exciting 
pulse  would  probably  remove  this  undesirable  character. 

4.10  Photoelectric  vVaveform  Synthesis 

A  series  of  experiments  was  conducted  in  an  investigation  of  means 
of  producing  phonerae-Uke  sounds.  Three  separate  photoelectric  techniques 
were  explored  in  this  work. 

(1)  Use  of  a  template  placed  In  front  of  a  CRO  screen  together 
with  a  viewing  photocell  and  feedback  to  the  CRO.  This  scheme  causes 


S'! 


NOTS  TP  3109 


Pert  n 

tiie  CRT  beam  to  position  itself  at  the  edge  of  the  template.  If  the  CRT 
beam  is  swept  by  the  horizontal  deflection  clrcnit  of  the  CRO,  the  beam 
follows  the  edge  of  the  template.  T!ie  signal  produced  by  the  photocell 
is  a  replica  in  time  of  the  waveform  represented  by  the  template.  The 
template  used  was  made  up  of  20  rods  each  individually  adj’ustable. 

(2)  Use  of  an  opaque  mask  on  a  CRO  screen  together  with  a  viewing 
photocell.  In  this  arraisgement ,  with  the  CRO  beam  swept  across  the 
screen  by  the  deflection  circuitry  of  the  CRO,  the  light  reaching  the 
photocell  would  be  blanked  out  as  the  beam  passed  under  opaque  regions 

of  the  mask.  The  photocell  output  thus  is  a  signal  of  one  of  two  values. 

(3)  Use  of  a  graded  mask.  A  graded  mask  w’as  made  in  much  the 
same  way  that  audio  signals  are  recorded  in  motion  picture  films.  The 
amplitude-time  history  of  a  typical  phoneme  waveform  during  one  pitch 
period  was  recorded  from  an  actual  speaker.  This  amplitude  time  history 
was  translated  into  film  density  versus  distance  along  the  film  by  a 
photographic  process.  The  graded  mask  so  made  was  placed  In  front  of  a 
CRO  screen.  When  the  CRT  beam  was  caused  to  sweep  across  the  CRT, 
the  light  transmitted  by  the  mask  to  a  photocell  caused  the  photocell 
output  to  reproduce  the  amplitude-time  history  of  the  original  speaker. 

Several  conclusions  were  reached  from  this  experimental  work.  The 
subjective  impression  of  sounds  obtained  by  feeding  headphones  or 
speakers  with  the  waveforms  generated  as  described  above  was  substantially 
the  same  as  for  sounds  generated  by  the  methods  discussed  In  other 
sections  of  this  report.  The  sounds  are  monotonous,  discrimination  among 
phoneme  sounds  is  greatly  improved  by  dynamics,  that  is,  changing  from 
sound  to  sound  rather  than  letting  one  sound  be  continuously  repeated. 

Mo  single  method  produced  a  significantly  better  sound  than  any  of  the 
others.  As  is  pointed  out  i,u  the  theory  section  some  randomness  and 
variability  in  the  repetition  rate  and  the  fine  structure  of  phoneme  waveform 
is  required  to  provide  the  quality  of  realism. 
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4.11  Double  Differentiator 

A  double  differentiator  circuit  for  speech  signals  was  designed  and 
constructed.  The  circuit  is  similat  to  that  used  in  the  Rasp  Generaio’'. 

A  12  db  per  octave  shape  is  show  from  300  cps  to  30  kc.  The  circuit  is 
shown  In  Figure  4-8. 

4.12  Three-Element  Synthesizer 

An  electronic  speech  sound  synthesizer  was  designed  and  con¬ 
structed.  The  circuit  diagram  of  Figure  4-9  illustrates  the  functions  of  the 
synthesizer.  The  repetition  rate  oscillator  is  an  unsymmetric  square  v/ave 
generator  of  voltage  controllable  frequency.  The  output  of  this  oscillator 
starts  each  of  the  interval  oscillators  in  synchronism  with  the  .repetition 
rate  oscillator  and  keeps  these  oscillators  operating  during  the  positive 
part  of  the  square  wave.  These  interval  oscillators  are  also  voltage 
controllable  so  that  the  interval  varies  in  accordance  with  the  control 
signal. 

The  repetition  rate  oscillator  also  feeds  tne  three  exponential 
envelope  generators.  Each  generator  produces  a  decaying  exponential 
wave  of  manually  adjustable  time  constant.  These  exponential  envelopes 
are  generated  in  synchronism  with  tne  output  of  the  repetition  rate 
oscillator. 

The  output  of  each  exponential  generator  is  mixed  with  the  output 
of  an  interval  oscillator.  The  resulting  waveform  Is  an  exponentially 
decaying  square  wave  pulse. 

Outputs  of  the  three  signal  mixers  are  summed  with  adjustable 
gain  into  a  common  output. 

The  net  output  Is  a  complex  wave  made  up  of  three  exponentially 
decaying  square  wave  trains  repeatedly  generated  at  a  voltage  programmable 


Fig.  4-8  Circuit  Diagram  of  Double  Differentiator 
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repetition  frequency.  The  intervals  of  the  Interval  oscillates  are  each 
voltage  programmable  and  the  decay  time  constants  may  be  manually 
adjusted. 

4.13  Phase-Lock  Whistle  Trackers 

One  of  the  problems  met  in  the  porpoise  translator  work  has  been 
the  presence  in  the  water  of  sounds  other  than  the  desired  porpoise 
whistles.  Porpoise  rasps  and  clicks  and  other  .spurious  sounds  frequently 
cause  undesired  response  from  the  porpolse-man  translators. 

It  appeared  promising  to  design,  construct  and  test  whistle  tracking 
circuits  which  would  eliminate  much  of  the  undesired  background  noise. 

To  this  end  two  tracking  circuits  were  designed,  fabricated  and  tested. 

Both  of  these  circuits  operate  as  phase  locked  tracking  loops.  In  a  phase 
lock  loop,  the  frequency  of  an  oscillator  is  controlled  such  that  its  fre¬ 
quency  is  the  same  as  that  of  an  input  frequency.  This  synchronism  is 
obtained  by  a  phase  detector  fed  by  the  oscillator  and  by  the  incoming 
signal.  The  output  of  this  detector  is  a  signal  which  is  proportional  to 
the  phase  difference  between  the  two  signals.  This  phase  "error"  signal 
is  filtered,  then  used  to  control  the  frequency  of  the  variable  oscillator. 
Once  synchronism  has  been  obtai'^ed,  the  va’^iable  oscillator  automatically 
tracks  the  Input  oscillator  in  frequency. 

The  difficulty  associated  with  using  a  phase  lock  Iracker  on 
porpoise  whistles  is  that  of  obtaining  the  Initial  synchronism.  This 
problem  is  overcome  in  the  circuits  constructed  by  previding  a  continuous 
sweep  of  the  voltage  controlled  oscillator  (VCO'  such  that  It  ranges  from 
4  kc  to  16  kc  20  times  per  second.  Synchronism  occurs  when  Ure  VCO 
frequency  is  the  same  as  the  ircorring  (porpoise  whistle)  frequency  and 
the  sweep  action  is  automatically  overridden.  This  mode  of  operation 
means  that  there  is  a  delay  time  between  onset  of  a  porpoise  whistle  and 


Fig.  4-9  Circuit  Diagram  of  Three  Element  Synthesizer 
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the  time  that  the  phase  lock  circuit  locks  onto  the  porpoise  whistle. 

A  furtliei  delay  occurs  because  it  is  desirable  to  gate  off  the  output  signal 
until  !ock--on  occurs.  This  function  is  obtained  by  using  the  sweep  fre¬ 
quency  signal  occurring  at  the  phase  detector  output  to  operate  a  gate. 

Figures  4-10  and  4-11  are  circuit  diagrams  of  the  two  trackers; 
one  produces  a  sine  wave  output,  the  other  a  square  wave,  lii  tiie  square 
wave  circuit  the  VCO  is  a  square  wave  oscillator,  while  in  the  sine  wave 
circuit  a  beat  frequency  oscillator  is  used. 

Tests  shov/ed  better  performance  for  the  square  wave  system. 

This  better  performance  is  attributed  to  the -better  dynamic  behavior  of 
the  VCO  used.  That  Is,  the  bandwidth,  lor  control  purposes,  of  the 
square  wave  oscillator  Is  wider  and  hence  lends  itself  better  to  use  in 
closed  loop  operation. 

4.14  Mod  III  Translators 

Based  on  the  earlier  work  on  man-to-pcrpoise  and  porpoise-to¬ 
man  translators  ,  it  was  decided  to  design  and  consiruct  a  new  translator 
system.  This  new  t-'anslalor  would  overcome  some  cl  the  deficiencies  of 
the  prior  system  and  sonre  of  the  awkwardness  of  the  prior  s/st.em  operation. 
The  objectives  sought  ir  this  v,rork  were' 

(1)  A  self  coptai.ned  s/stem.  Ail  corr.po.'ie-ns  ,  pioamps,  power 
amplifiers,  baueiies,  etc.  would  be  'n  -^i-e  package. 

(2)  The  irstrumen:  would  be  f ecnarge-ible  battery  operated. 

(3)  fmprcvemenrs  would  be  ma  -e  in  the  t  ans  la- ors  based  on  the 
prior  experience  . 

Unfo/tunareiy ,  it  was  necessary  to  suspend  work  cri.  i.nis  project 
prior  to  co.mpleTiori .  However  the  essei.tiui  ci-cuits  have  been  designed. 
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constructed  and  debugged.  Remaining  to  be  done  is  the  design  and 

construction  of  the  battery  charging  circuits  and  the  packaging  of  the 

■ 

entire  system. 

Figures  4-12  to  4-16  are  circuit  diagrams  of  the  translators. 

4. 15  Tests  were  conducted  at  Ft.  Mugu  in  late  August  using  the  equip¬ 
ment  described  in  reference  3.4.  A  meta- language  was  developed,  the 
purpose  of  which  is  to  permit  man  vocalizations  to  be  translated  to  what 
is  considered  meaningful  modulated  whistles.  The  word  list  to  be  used 
in  interaction  studies  was  finalized,  as  follows: 


1. 

bielb 

5.  beaeb 

9. 

baiab 

2. 

blalb 

6.  beaib 

,10. 

baleb 

3. 

bleab 

7.  beleb 

n. 

baeab 

4. 

biaeb 

8,  beiab 

12. 

baeib 

Tapes  made  show  well  defined  rapid  sonic  response  to  Input  vocalizations. 
Initially,  two  words,  "beleb"  and  "belao"  were  used  and  Imitation  of  each 
obtained  from  the  dolphin.  The  usage  by  the  dolphin  was  not  systematic. 
Subsequently,  Interaction  was  conducted  In  three  words,  "beleb,"  "beiab” 
and  "baiab" .  Ijr.itation  was  obtained  for  the  first  two  but  not  for  "baiab"  . 

A  review  of  the  tapes  made  at  Ft.  Mugu  led  to  the  consideration 
of  the  following  letters  for  present  and  future  use  In  the  meta-language: 
b,  e,  a,  i,  w,  y,  i.  The  verification  world  used  In  the  tests  should  now  be 
spelled  "biyib''  and  the  negation  word  spelled  "bayal" .  Tentative  construc¬ 
tion  for  new  words,  subject  to  laboratory  tests ,  include  the  following: 


1. 

raysb 

5. 

waeb 

9. 

yareb 

2. 

rib 

6. 

wlb 

10. 

yib 

3. 

raib 

7. 

wayeb 

11. 

yarib 

4. 

arlb 

8, 

awlb 

12. 

ayib 
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Fig.  4-10  Circuit  Diagram  of  Square  Wave  Porpoise  Whistle  Tracker 


1 


Fig.  4"11  Circuit  Diagram  of  Sltie  Wave  Porpoise  Whistle  Tracker 


I 


NOTS 


•%  *% 


TP  3109 


c^ 

u  H 

E  I 
w  ^ 
s 

I  i 


Ul  « 

«  'ii 

'I 

S  g 
^  % 


a  13 
8  o 

m  H 

w  « 

I  ^ 


Ftg.  4-12  Circuit  Diagram  of  Mod  III  Man-Porpolse  Signal  Processor 
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Fig,  4-13  Circuili  Diagram  of  Mod  HI  Man-Porpolse  Voice  Interval  Extractor 


Fig.  4-14  Circuit  Dlagrain  of  Mod  m  Man-Porpolse  BFO 


Fig.,  4-15  Circuit  IXagraTn  of  Mod  HI  Porpolse-Mari  I'ranslatbr 
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